Show the code
source("utils.R")This website is still under active development - all content subject to change
spe <- readRDS("../data/spe.rds")
#define the Z-stacks that you want to compare
zstack_list <- list("-0.09", "0.01", "0.21")
#define the celltype that you want to compare across the stacks - hereby we assume independence across the z-stacks which is an assumption that can be challenged
celltype_ls <- "OD Mature"
selectZstacks <- function(zstack, spe){
sub <- spe[, spe$sample_id == zstack]
pp <- .ppp(sub, marks = "cluster_id")
return(pp)
}
pp_ls <- lapply(zstack_list, selectZstacks, spe)
names(pp_ls) <- zstack_listThe theory of spatial point patterns is discussed in great detail in (Baddeley, Rubak, and Turner 2015). The book has an accompanying package called spatstat which offers great functionality to the theoretical concepts described in the book (Baddeley and Turner 2005). This chapter relies heavily on both publications.
In point pattern analysis we assume that the patterns we observe are a realisation of a stochastic process called a point process. The inferences we make about the point pattern are based on the point process. For example, if a pattern is created by a Poisson point process it will be evenly distributed in the observation window (Baddeley, Rubak, and Turner 2015, 127).
When considering a pattern with \(m\) multiple types, as we do in the (Moffitt et al. 2018) dataset, there are two very closely related concepts. One can view the pattern as a multitype point pattern, where all the points are sampled from the same point process. The other option is to consider the pattern as a multivariate point pattern, where the points come from \(m\) distinct point processes. The difference between these two views is that in the multitype framework we assume the points to stem from the same point process. In the multivariate framework we assume that the types stem from distinct point processes and therefore we can consider dependencies of one type alone. Whether or not the patterns stem from the same point process depends on the biological question. If we analyse two cell types in one slice of a tissue, we should consider them as being sampled from one point process. However, if we consider the distribution of a cell type in two slices of the same tissue we can have grounds to consider them as distinct processes (Baddeley, Rubak, and Turner 2015, 564–65).
The most common set up in point pattern analysis is what we call window sampling. Instead of observing the entire pattern we observe a subset of this pattern in the so called window. An example could be different small microscopy windows through which a big tissue slice is observed. In this case, it would be wrong to assume the window to be the convex hull around the observed points because they are just a sample of the bigger point pattern (Baddeley, Rubak, and Turner 2015, 143–45).
There is another concept called the small world model. It assumes that points can only be observed in a finite small world and not beyond these boundaries. When thinking of an entire tissue, this is a very common scenario. Cells can only be observed within the tissue and not beyond. In this case, it would be correct to not assume a rectangular observation window but to use methods to estimate an unknown sampling window such as the Ripley-Rasson estimate of a spatial region (Baddeley, Rubak, and Turner 2015, 144–45; Ripley and Rasson 1977).
setRiprasWindows <- function(pp){
Window(pp) <- ripras(pp)
return(pp)
}
#the entire point patterns with the ripras windows
pp <- lapply(pp_ls, setRiprasWindows)
separateMarks <- function(pp){
#split the multitype point process into several single type processes
ppls <- split(pp)
return (ppls)
}
#the point patterns separated by their marks
pp_ls <- lapply(pp, separateMarks)Complete spatial randomness (CSR) is often used as the null model for various point patterns. It is the result of a Poisson process. A completely spatial random process is characterised by two properties, homogeneity and independence, as discussed below (Baddeley, Rubak, and Turner 2015, 132).
“Homogeneity […] means that the expected number of points falling in a region B should be proportional to its area |B|” (Baddeley, Rubak, and Turner 2015, 132) given a proportionality constant \(\lambda\). The constant \(\lambda\) represents the intensity of the process, i.e., the average number of points in a unit area (Baddeley, Rubak, and Turner 2015, 132–33). :
\[ \mathbb{E}[X\cap B] = \lambda |B|. \label{eq:expected_number_points} \]
Independence implies that in two (non-overlapping) regions \(A\) and \(B\), the number of points \(n(X\cap A)\) and \(n(X\cap B)\) are independent random variables. In other words, the number of points in region \(A\) does not affect the number of points in region \(B\). (Baddeley, Rubak, and Turner 2015, 133).
A Poisson process that is spatially varying in its average density of points is called inhomogeneous. Here, the average density, \(\lambda(u)\), sometimes known as the intensity function (see below), is a function of the spatial location \(u\). In this case, the expected number of points falling into a region \(B\), \(\mu = n(X\cap B)\), is an integration of the intensity function over that region (Baddeley, Rubak, and Turner 2015, 138).
\[ \mu = \int_{B} \lambda(u) du. \label{eq:expected_number_inhomogeneous} \]
A point process is called isotropic, if its statistical properties are invariant to rotations; a CSR process is both stationary and isotropic (Baddeley, Rubak, and Turner 2015, 147).
“A point process is called stationary if, when we view the process through a window W , its statistical properties do not depend on the location of the window in two-dimensional space.” (Baddeley, Rubak, and Turner 2015, 146). This is the case for any homogeneous point process, where the statistical properties of the pattern are unchanged given shifting of the observation window. This means it is stationary in all statistical properties; first-order properties (e.g. intensity) and second-order properties (e.g. correlation) (Baddeley, Rubak, and Turner 2015, 218). Not all metrics assume stationarity in its full sense. Inhomogeneous metrics only assume second-order / correlation stationarity. That means while the intensity function can vary spatially (first-order stationarity is not given), the estimates of correlation functions (e.g. the inhomogeneous K-function) should be the same in parts of the window (Baddeley, Rubak, and Turner 2015, 689 ff.).
If a process is not correlation stationary, so the estimates of the inhomogeneous metric vary between locations, locally-scaled versions of the metric can be applicable. This means in subregions, the process is still stationary and isotropic, but there is a rescaling factor that can vary across the total process (Baddeley, Rubak, and Turner 2015, 246–47).
We can use a permutation test to test the inhomogeneity assumption. In this scenario, we split the patterns into quadrats and compare the estimated functions between the quadrats. It should be noted that this test depends on the arbitrary definition of the quadrats. Given our chosen patterns are not independent but result as marks from an overall point-pattern, the permutation approach is questionable (Baddeley, Rubak, and Turner 2015, 689–93).
permutation_test <- function(pp, mark, split, minpoints) {
pp_sel <- subset(pp, marks %in% mark, drop = TRUE)
rho_est <- rhohat(unmark(pp_sel), "x", method="tr")
lambda <- predict(rho_est)
tesselation <- quantess(unmark(pp_sel), "x", 3)
tesselation_split <- nestsplit(pp_sel, tesselation, ny=split)
plot(tesselation_split, main = mark)
tesselation_split$inten <- factor(as.integer(tesselation_split$f1) <= 1, labels=c("Hi","Lo"))
res.scaled <- studpermu.test(tesselation_split, pts ~ inten, summaryfunction=Kscaled,
minpoints = minpoints)
res.inhom <- studpermu.test(tesselation_split, pts ~ inten, summaryfunction=Kinhom,
lambda=lambda, minpoints = minpoints)
#p-value of the local-scaling test
print(paste0(mark,' local scaling test ', res.scaled$p.value))
#p-value of the inhomogeneity test
print(paste0(mark,' inhomogeneity test ', res.inhom$p.value))
}
lapply(c("Microglia", "OD Mature", "Ependymal"), function(x) permutation_test(pp[['0.01']], x, split = 3, minpoints = 10))The p-value of the test for local scaling for microglia cells is \(<0.05\) which indicates that the assumption of local scaling is rejected. Therefore, the distribution of microglia cells is not a scaled version of an overall distribution pattern. The p-value of the test for inhomogeneity for both microglia cells is \(>0.05\) indicating that the assumption of correlation stationarity is not rejected. In this case we can use the inhomogeneous version of the K-function which assumes correlation stationarity.
For ependymal and OD mature cells however, the p-values for both the local scaling test and the inhomogeneity test are \(>0.05\) which means that for this choice of quadrats both the correlation stationarity assumption and the local scaling assumption can’t be rejected.
As the interpretation of the permutation test is highly dependent on the quadrats, the results should be interpreted with care. Both inhomogeneous and locally scaled versions of the summary functions have support and both offer interesting insights into the spatial pattern. Therefore, we will compare all versions and show what the choice of metrics means for their interpretation.
Intensity is the expected density of points per unit area. It can be interpreted as the rate of occurrence or the abundance of events. The intensity represents a first order property because it is related to the expected number of points. More formally the average intensity of a point process is defined as:
\[ \bar{\lambda} = \frac{n(x)}{|W|} \label{eq:average_intensity} \]
As this is an average over the entire window, it is a good estimate for a homogeneous point process (Baddeley, Rubak, and Turner 2015, 157–60)
For a homogeneous point process, the intensity can be estimated in a simplistic way: summing the individual intensities of the marks (Baddeley, Rubak, and Turner 2015, 161).
[1] 0.001909
Otherwise, we can compute the intensity for each mark individually.
In kernel estimation, we try to estimate the intensity function \(\lambda(u)\) of the point process. There is a wide variety of kernel estimators (see (Baddeley, Rubak, and Turner 2015, 168)), but a popular choice is the isotropic Gaussian kernel where the standard deviation corresponds to the smoothing bandwidth (Baddeley, Rubak, and Turner 2015, 168).
In quadrat counting, all points falling into a given quadrat are counted. This gives an overview on the characteristics of the point pattern, such as correlation stationarity (Baddeley, Rubak, and Turner 2015, 163).
Under independence assumptions, the quadrat counts can be used for testing homogeneity, i.e., whether the points are distributed evenly across the quadrats (Baddeley, Rubak, and Turner 2015, 164–65).
Conditional Monte Carlo test of CSR using quadrat counts
Test statistic: Pearson X2 statistic
data: pp_ls[["0.01"]]$`OD Mature`
X2 = 635.09, p-value = 1
alternative hypothesis: regular
Quadrats: 25 tiles (irregular windows)
A p-value of 1 indicates that the null hypothesis of irregularity can not be rejected strongly. Thus, the point pattern of oligodendrocyts is strongly irregular.
Alternatively, we can inspect departures from the hypothesis that points were generated by a Poisson process. We can identify hotspots and coldspots by comparing the standard error of the relrisk function, which computes nonparamatric estimates of the relative risk by kernel smoothing, to the theoretical null distribution of points. The relative risk is the ratio of spatially varying probablilities of different types (Buller 2020).
Whether or not a point process is completely spatially random (CSR) depends on two characteristics: points need to be distributed homogeneously and they have to be independent of each other (see definitions above). There are various ways to test for CSR, here we show the use-case of the clark-evans test (Baddeley, Rubak, and Turner 2015, 165–66).
In the following document we will often compare the distribution of mature oligodendrocytes (OD mature cells) across different z-slices of the same tissue. We assume these slices to be enough apart to be considered as generated by different point processes. Since we consider the dependence of one mark among itself, we are in a within cell type setting per slide. We compare several curves along different z-slices, which is in turn a multivariate comparison (Baddeley, Rubak, and Turner 2015, 564 ff.).
In our example dataset we analyse the mouse preoptic hypothalamus (Moffitt et al. 2018). The lower boundary is the end of the tissue whereas the upper three boundaries are a technical boundary. Therefore, our example is a mixture between window sampling and the small world model. In order to decrease the bias of the tissue border, we use the Ripley-Rasson estimate of a spatial domain to estimate the sampling window (Baddeley, Rubak, and Turner 2015, 55; Ripley and Rasson 1977).
[[1]]
Symbol map with constant values
cols: #000000E4
[[2]]
Symbol map with constant values
cols: #000000C2
[[3]]
Symbol map with constant values
cols: #0000007B
null device
1
Correlation, or more generally covariance, represents a second-order summary statistic and measures dependence between data points (Baddeley, Rubak, and Turner 2015, 199 ff.).
With Ripley’s \(K\) we measure “the average number of \(r\)-neighbours of a typical random point” (Baddeley, Rubak, and Turner 2015, 204). This number is still dependent on the size of the observation window so we can normalise it by the number of points \(n\) and the window size, \(|W|\). We then obtain the empirical Ripley’s \(K\) function (Baddeley, Rubak, and Turner 2015, 204; Ripley 1977, 1976):
\[ \hat{K}(r) = \frac{|W|}{n(n-1)}\sum_{i=1}^n\sum_{j=1 \\j \neq i}^n\{d_{ij}\leq r\} e_{ij}(r) \]
The standardisation makes it possible to compare point patterns with different observation windows and with different numbers of points. However, using the empirical \(K\) function assumes that the point process has homogeneous intensity, which is often not the case for biological tissue (Baddeley, Rubak, and Turner 2015, 204–5). We will return to this issue below in the Correcting for Inhomogeneity. The term \(e_{ij}(r)\) is an edge correction. We will briefly cover this in Edge effects and their corrections for spatial metrics
Edge effects describe the phenomenon that not the entire point process is observed, but rather only the part within the window \(W\). This means the value of various statistics could be biased along the edges (Baddeley, Rubak, and Turner 2015, 213 ff.).
There are many corrections for edge effects that are briefly listed here (Baddeley, Rubak, and Turner 2015, 214–19):
Border correction:
In border correction the summation of data points is restricted to \(x_i\) for which \(b(x_i,r)\) is completely in the window \(W\).
Isotropic correction:
We can regard edge effect as a sampling bias. Larger distances (e.g. close to the edges) are less likely to be observed. This can be corrected for.
Translation correction:
A stationary point process \(X\) is invariant to translations. So the entire point process can be shifted by a vector \(s\) to be at the position \(X+s\).
The \(K\)-function can be ``centered’’, which is then referred to as Besag’s \(L\)-function. The \(L\)-function is a variance-stabilising version of the \(K\)-function (Canete et al. 2022; Besag 1977, 1977):
\[ L(r) = \sqrt{\frac{K(r)}{\pi}}. \]
By taking the square root the variance is approximately constant across the domain (Bartlett 1947).
We have seen above that the \(K\)-function is cumulative. That is, the contributions of all distances smaller equal to \(r\) are considered. An alternative is to take the derivative of the \(K\)-function in order to obtain contributions of distances between points equal to \(r\), according to:
\[ g(r) = \frac{K'(r)}{2\pi r}, \]
where the derivative of the \(K\) function divided by the probability of a Poisson process at this radius (Baddeley, Rubak, and Turner 2015, 225 ff.).
res_ls <- lapply(list('Kest', 'Lest', 'pcf'), function(fun){
res <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = fun, marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
return(res)
})
p_ls <- lapply(res_ls, function(res){plotMetricPerFov(res, theo = TRUE, correction = "iso", x = "r", image_id = 'sample_id')})
In the homogeneous case we see that all slices are above the Poisson line, indicating positive association for all slices. The association is strongest for slice $-0.09$ followed by $0.01$ and $0.21$. Interestingly, at radii \(r>300\) the \(K\) curve for slice \(0.21\) crosses the other two curves.
In the case that a spatial pattern is known or suspected to be inhomogeneous, we have to take this into account in the analysis. Inhomogeneous alternatives can be estimated via:
\[ \hat{K}_{inhom}(r) = \frac{1}{D^p|W|}\sum_i\sum_{j \neq i} \frac{\mathbb{1}\{||u-x_j||\leq r\}}{\hat{\lambda}(x_j)\hat{\lambda}(x_i)}e(x_j,x_i;r), \]
where \(e(u,v;r)\) is an edge correction weight and \(\hat{\lambda}(x_i)\) is an estimator of the intensity at point \(x_i\). The estimation of the local intensities can happen in a data-dependent manner via kernel-smoothing. As this is the same data to then calculate the metric on, this can lead to biases. However, if the local intensities are known, the inhomogeneous \(K\) function is an unbiased estimator (Baddeley, Rubak, and Turner 2015, 242–46).
res_ls <- lapply(list('Kinhom', 'Linhom'), function(fun){
res <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = fun, marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
return(res)
})
res_pcf <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = 'pcfinhom', marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res_pcf <- subset(res_pcf, sample_id %in% c('-0.09', '0.01', '0.21'))
p_ls <- lapply(res_ls, function(res){plotMetricPerFov(res, correction = "border", theo = TRUE, x = "r", image_id = 'sample_id')})
p <- plotMetricPerFov(res_pcf, correction = "iso", theo = TRUE, x = "r", image_id = 'sample_id')The inhomogeneous \(K\)-function indicates that the oligodendrocytes of slice 0.21 are close to a Poisson process (dashed line) and can therefore be assumed to be randomly distributed and not clustered. The two other slices show a slightly stronger association among the oligodendrocytes than the slice 0.21.
The \(L\) function is complementary to the \(K\) function in this example.
The pair correlation function is the derivative of the \(K\)-function. The pcf plot gives similar information as before: Oligodendrocytes show the strongest association at \(\sim r = 25\) whereas the association is weaker in slice 0.21.
Interestingly, the curves for the inhomogeneous functions of slices -0.09 and 0.01 cross the Poisson line at \(r\sim350\). This means that the inhomogeneous functions find repulsion of slices past a radius of \(350\).
In the inhomogeneous \(K\)-function approach above, we assume that the intensities can vary locally but the scale of the point process is not changed. This means that while the intensities might vary in the parts of the point pattern, the pattern in one subquadrat is not just a scaled version of another subquadrat (Baddeley, Rubak, and Turner 2015, 246–47; Prokešová, Hahn, and Jensen 2006).
To account for this local scaling, we can assume that the process is subdivided into small regions. In these small regions, the point process is a scaled version of a template process. This template process needs to be both stationary and isotropic (Baddeley, Rubak, and Turner 2015, 246–47).
Since the \(L\)-function is simply a transformation of the \(K\)-function, the same local scaling framework can be applied to the \(L\)-function (Baddeley, Rubak, and Turner 2015, 246–47).
res_ls <- lapply(list('Kscaled', 'Lscaled'), function(fun){
res <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = fun, marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
return(res)
})
p_ls <- lapply(res_ls, function(res){plotMetricPerFov(res, correction = "iso", theo = TRUE, x = "r", image_id = 'sample_id')})We see, that in slice 0.01 oligodendrocytes are far from the Poisson process line, indicating a strong association. The other two slices show a less strong association.
In the plot above we see all variants of the correlation metrics. The assumptions of either homogeneity (first row), inhomogeneity (second row) and local scaling (third row) change the interpretation of these example point patterns. In summary, in this example homogeneous variants show a positive association for all slices whereas inhomogeneity infered a Poisson distribution for slice \(0.21\). Furthermore, the inhomogeneous variant estimated repulsion for slices \(-0.09\) and \(0.01\), whereas homogeneous variants estimated clustering for all radii, whereas slice \(0.21\) became stronger than the other two slices past \(r>350\). The locally scaled version showed positive associations for all slices and no crossing of curves over the radii.
Deciding whether a pattern is homogeneous or inhomogeneous depends on the biological question. We provide some recommendations in the “Getting started” vignettes.
It is worth noting that the \(K\)- and \(L\)-functions described above are summary statistics over the entire pattern (i.e., averaged over all points). However, if we know that there are different regions in our point pattern, an alternative strategy is to compute `local` contributions to these patterns, i.e., local \(K\)- ,\(L\)- or pair-correlation functions. Baddeley et. al. propose to compare these \(n\) functions with so-called functional principal component analysis (see below). We will show here the example of the LISA version of the \(L\)-function (Baddeley, Rubak, and Turner 2015, 247–48).
L_odmature_lisa <- localL(pp_ls[['0.01']]$`OD Mature`)
df <- as.data.frame(L_odmature_lisa)
dfm <- reshape2::melt(df, "r")
get_sel <- dfm %>% filter(r > 200.5630 & r < 201.4388, variable != "theo") %>%
mutate(sel = value) %>% select(variable, sel)
dfm <- dfm %>% left_join(get_sel)
p <- ggplot(dfm, aes(x=r, y=value, group=variable, colour=sel)) +
geom_line() +
scale_color_continuous(type = "viridis") +
geom_vline(xintercept = 200) +
theme(legend.position = "none") +
theme_light() +
ggtitle("LISA curves of slice 0.01")
ppdf <- as.data.frame(pp[['0.01']]) %>% filter(marks=="OD Mature")
ppdf$sel <- get_sel$sel # assume they are in same order
q <- ggplot(ppdf, aes(x=x, y=y, colour=sel)) +
geom_point(size=0.75) +
scale_color_continuous(type = "viridis") +
theme(legend.position = "none") +
theme_light()+
ggtitle("Points coloured by LISA value at r ~ 200")In the case of the OD mature cells, we obtain further information with this plot. We note that there are two distinct populations of curves: those that are clearly above the mean LISA curve in black and others that are around/underneath. This indicates that there are two different kinds of interactions in the OD mature cells. Stronger and less clustered regions.
There are inhomogeneous versions of these (e.g. localLinhom) that are not shown here for brevity.
We apply functional PCA to retrieve the main trends in these individual curves. The idea of functional PCA is the same as for ordinary PCA but applied to functional data (i.e., each observation is a function). For the \(n\) functions above, functional PCA will recover the main trends in the data (Ramsay and Silverman 2005). We use the R package refund to perform functional PCA (Xiao et al. 2016).
#normalise the data
df_fdob <- asinh(df %>% as.matrix / 50) %>% as.data.frame()
# extract the functional response matrix
mat <- df_fdob %>%
select(!c(r,theo)) %>%
t()
# create a dataframe as required by pffr
dat <- data.frame(ID = rownames(mat))
dat$Y <- mat
dat$sel <- get_sel$sel
# perform functional PCA
res <- functionalPCA(dat = dat, r = df_fdob$r |> unique(), knots = 30, pve = 0.99)
# extract the scores
scores_df <- res$scores %>% as.data.frame()
# plot a biplot
p_biplot <- ggplot(scores_df, aes(scores_df[, 1], scores_df[, 2], colour = (dat[['sel']]))) +
geom_point() +
coord_equal() +
theme_light() +
scale_color_continuous(type = "viridis") +
xlab('PC1') +
ylab('PC2')p1 <- ggplot(ppdf, aes(x=x, y=y, colour = res$scores[,1])) +
scale_color_continuous(type = "viridis", name = 'loading PC1') +
theme_light() +
geom_point(size=0.75)
p2 <- ggplot(ppdf, aes(x=x, y=y, colour = res$scores[,2])) +
scale_color_continuous(type = "viridis", name = 'loading PC2') +
theme_light() +
geom_point(size=0.75)The biplot shows the distribution of the first two loadings of the functional PCA. The points are coloured as they were in the plots of the LISA \(L\)-curves. The first principal component clearly separates the two populations. In the last plot we project the loadings of the fPCs back onto the biological slices and find the same separation.
So far we have considered first- and second-order summary statistics and local (or inhomogeneous) adaptations of them. In the second order, one considers (counts of) pairs (e.g., \(K\) function). In a third-order setting, we would count triplets of points. A triplet is counted as the normalised expected value of triangles where all edges are smaller than the radius \(r\) (Baddeley, Rubak, and Turner 2015, 249).
So far, most approaches considered intensity and correlation as measures to assess a point pattern. Next, we will look at measures of spacing and shortest-distances to assess spatial arrangements (Baddeley, Rubak, and Turner 2015, 255).
Baddeley et al. summarises three basic distances:
Note also that there are tests of CSR that are based on spacing, including the Clark-Evans and Hopkins-Skellam Index tests that were discussed above ``Testing for CSR’’.
Nearest neighbour (NN) methods are based on the notion of “nearness”. In particular, we introduce nndist from spatstat, a method to calculate the distances until \(k\) NN are found. This function returns a density for each specified \(k\) for the \(k\) neighbour distances. We can for instance collapse the \(k\) curves into a mean curve per point pattern. This information of the mean nearest neighbour distance (MMND) can be summarised as a density. Note, that these distances are “raw” nearest-neighbour distances which are not corrected for edge effects. Edge correction for the nearest neighbour distance (\(k = 1\)) is implemented in the function Gest below (Baddeley, Rubak, and Turner 2015, 256) (Baddeley and Turner 2005).
nndistance <- function(pp, nk){
xy <- cbind(pp$x, pp$y)
nndistances_k15 <- nndist(xy, k = nk)
nndistances_mean <- rowMeans(nndistances_k15)
return(nndistances_mean)
}
#PRE: list of point pattern, corresponding celltypes of interest, functions to evaluate
#POST: result of the metric
metricRes_nndist <- function(ppls, celltype, fun){
metric.res <- list(res = do.call(fun, args = list(pp=ppls[[celltype]], nk = seq(1:15))))
metric.res$type = celltype
return(metric.res)
}
# [MR: again, this function looks again like those before and maybe could be done as an all-in-one wrapper.]
celltypes <- c("Ependymal", "OD Mature", "Microglia")
#go through all defined celltypes and calculate the nearest-neighbour distance
res_ls <- lapply(celltypes, metricRes_nndist, fun = nndistance, ppls = pp_ls[['0.01']])
#initialise a dataframe for the metric values and the type information
res_df <- data.frame(metric = numeric(0), type = character(0))
# Loop through the res_ls list and combine the metric values with their corresponding type - ChatGPT
for (i in 1:length(res_ls)) {
metric_values <- res_ls[[i]]$res
metric_type <- rep(res_ls[[i]]$type, length(metric_values))
df <- data.frame(metric = metric_values, type = metric_type)
res_df <- rbind(res_df, df)
}
#plot the densities
p <- ggplot(res_df, aes(x=metric, col= type))+
geom_density(linewidth=1)+
scale_x_sqrt() +
theme_light() +
ggtitle('Sqrt of the Mean Nearest-Neighbour Distance')
pIn the MNND empirical distribution, the ependymal cells show the shortest NN distances, a reflection of their clustering. The OD mature cells have larger NN distances as well as a bimodal distribution, indicating a mix of shorter and longer distances (as visible in the LISA \(L\)-functions). Microglia cells show the widest distances and the symmetry of the curve indicates similar distances throughout the field of view.
In a stationary spatial point process, the empty-space distance is defined as:
\[ d(u,X) = \min\{||u-x_i||: x_i \in X\} \]
Note that this is an edge-corrected distribution function of the nearest-neighbour distance above.
The empty space function is then the cumulative distribution function of the empty-space distances defined above:
\[ F(r) = \mathbb{P}\{d(u,X)\leq r\}. \]
The NN distance is defined as:
\[ d_i = \min_{j\neq i}||x_j-x_i||. \]
The NN distance distribution function \(G(r)\) is then defined as:
\[ G(r) = \mathbb{P}\{d(x,X\backslash u \leq r |X\ has\ a\ point\ at\ u\}. \]
For a homogeneous Poisson process, the NN distance distribution is identical to the empty-space function of the same process:
\[ G_{pois} \equiv F_{pois}. \]
For a general point process, the \(F\) and \(G\) functions are different (Baddeley, Rubak, and Turner 2015, 261–67).
The \(F\) and \(G\) functions are, like the \(K\) function, cumulative. The same disadvantages as with the \(K\) function occur here too, namely their cumulative nature. Therefore, an analogue to the pair-correlation function would make sense to consider. For practical reasons, this is no longer the derivative of the \(F\) function but rather a hazard rate:
\[ h(r) = \frac{f(r)}{1-F(r)}. \]
The concepts of the empty-space function \(F\) and the NN function \(G\) are complementary. If one decreases, the other increases.
Thus, the \(J\) function is a combination of both functions:
\[ J(r) = \frac{1-G(r)}{1-F(r)}. \]
For a CSR process, \(J_{pois} \equiv 1\), whereas values of \(J(r) > 1\) are consistent with a regular (e.g., repelling) pattern, and $J(r) < 1 represents a clustered process (Baddeley, Rubak, and Turner 2015, 275–77).
res_ls <- lapply(list('Gest', 'Fest', 'Jest'), function(fun){
res <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = fun, marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
return(res)
})
p_ls <- lapply(res_ls, function(res){plotMetricPerFov(res, correction = "rs", theo = TRUE, x = "r", image_id = 'sample_id')})There are inhomogeneous variants of the spacing functions explained above (Baddeley, Rubak, and Turner 2015, 277–78)
res_ls <- lapply(list('Ginhom', 'Finhom', 'Jinhom'), function(fun){
res <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = fun, marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
return(res)
})
p_ls <- lapply(res_ls, function(res){plotMetricPerFov(res, correction = "bord", theo = TRUE, x = "r", image_id = 'sample_id')})The inhomogeneous curves look different to their homogeneous counterparts but the relative ordering is of the curves per plot is the same.
Next to the NN distance, we can estimate the orientation of the neighbours, which gives an indication of the orientation of the spacing. It works by taking the angle between each point and its \(k^{th}\) nearest neighbour. The angle is anticlockwise from the x-axis (Baddeley, Rubak, and Turner 2015, 278–79) (Baddeley and Turner 2005).
res <- calcMetricPerFov(spe, 'OD Mature', subsetby = 'sample_id', fun = 'nnorient', marks = 'cluster_id', r_seq=NULL, by = c('Animal_ID','sample_id'))
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
p <- plotMetricPerFov(res, correction = "bordm", theo = TRUE, x = "phi", image_id = 'sample_id')The values of \(\phi\) correspond to the orientation of the original point pattern. The horizontal axis goes from \(180\) to \(0\) (left to right) and the vertical from \(90\) to \(270\) (top to bottom)
an easier representation of the above metric can be plotted as a rose diagram
[[1]]
NULL
[[2]]
NULL
$`-0.09`
window: rectangle = [-0.004850711, 0.004850711] x [-0.004850711, 0.004850711]
units
$`0.01`
window: rectangle = [-0.005200848, 0.005200848] x [-0.005200848, 0.005200848]
units
$`0.21`
window: rectangle = [-0.004397865, 0.004397865] x [-0.004397865, 0.004397865]
units
null device
1
The two plots are complementary and show which are the preferred orientations of the point patterns. Furthermore, they show whether or not the assumption of isotropy (no change in the statistical properties of a point pattern after rotations) is justified or not (Baddeley, Rubak, and Turner 2015, 236 ff.). Isotropy is an assumption that a lot of spatial metrics make and in our example we note, that the point patterns are in fact anisotropic. An option for analysing anisotropic stationary point patterns is to not calculate the metric on the actual point pattern but rather calculating it on the fry plot of the point pattern. This generalises e.g. Ripley’s \(K\) function from circles to arbitrary shapes (Baddeley, Rubak, and Turner 2015, 239 ff.).
Note also that the concepts of spacing are not only usable in point pattern analysis but also more broadly in other spatial contexts (e.g., spacing between shapes instead of points) (Baddeley, Rubak, and Turner 2015, 279 ff.).
The same consideration about edge effects as for the \(K\) (and related) functions need to be made for the spacing functions; uncorrected estimates are negatively biased estimators. The easiest approach is to draw an artificial border and consider NNs within it. Other approaches are based on sampling. Yet another approach relates to survival analysis, with the idea is that a circle of a point to grows homogeneously with increasing radius until it hits the frame of the window and “dies”. This gives survival distributions similar to censored data, where the Kaplan-Meier estimator is the optimal choice (Baddeley, Rubak, and Turner 2015, 285–92).
In spatstat, a mark can basically take any value, discrete (e.g. cell types, as we have seen above) or continuous (e.g., gene expression) (Baddeley, Rubak, and Turner 2015, 637 ff.). In our example, we take the gene expression of some marker genes from Fig. 6 of the original publication (Moffitt et al. 2018). This is a typical numerical mark for points in a biological dataset.
Here we see spatial distribution of the counts of the three genes Slc18a2, Esr1 and Pgr. The size of the circles indicates the counts of the transcripts at that spot. Since there are really a lot of points, we can’t easily distinguish general patterns of count distributions.
The function pairs from spatstat generates a scatterplot of the counts of the marks (in our case the three genes) against each other and against the \(x\) and \(y\) coordinates. We can add a non-linear smoothing curve to make the general trends a bit more obvious (Baddeley, Rubak, and Turner 2015, 641).
We find that the counts of the three genes are very evenly distributed along the \(x\) and \(y\) coordinate, indicating a homogeneous distribution. The counts of Esr1 and Pgr are positively associated, indicating a dependence of these two marks.
NN interpolations uses the nearest mark to measure the intensity at each spatial location. This is conceptually similar to taking a very small bandwidth for Gaussian kernel smoothing (Baddeley, Rubak, and Turner 2015, 642).
We see that there is e.g. a clear spatial structure in the expression of e.g. Esr1. It shows a half moon shape.
As in the discrete case, summary functions assume that the point process is stationary.
The mark correlation function measures the dependence between two marks for two points at distance \(r\). It is applicable to stationary point processes with marks. The generalized mark correlation function is given by:
\[ k_f(r) = \frac{\mathbb{E}[f(m(u),m(v))]}{\mathbb{E}[f(M,M')]},\]
where \(f(m(u),m(v))\) is a test function with two arguments (representing the two marks at locations \(u\) and \(v\)) and returns a non-negative value. For continuous non-negative marks, the canonical choice for \(f\) is typically \(f(m(u),m(v))= m(u)m(v)\). \(M\) and \(M′\) represent independent, identically distributed random points with the same distribution as the mark of a randomly chosen point. This denominator is chosen such that random marks have a mark correlation of 1 (Baddeley, Rubak, and Turner 2015, 644–45).
From this plot we show that all genes show a positive correlation at small distances which decline with increasing radius \(r\). The association is strongest for the Slc18a2 gene. We can calculate simulation envelopes to estimate the significance of this association. This is not shown for brevity.
The mark-weigthed \(K\)-function is a generalization of the \(K\)-function in which the contribution from each pair of points is weighted by a function of their respective marks. It is given by:
\[K_f(r) = \frac 1 \lambda \frac{C_f(r)}{E[ f(M_1, M_2) ]},\] where:
\[ C_f(r) = E \left[ \sum_{x \in X} f(m(u), m(x)) 1\{0 < ||u - x|| \le r\} \; \big| \; u \in X \right], \]
is equivalent to the unnormalized mark-weighted \(K\)-function. For every point \(u\), we sum the euclidean distance \(||u - x||\) of all other points \(x\) that are within a distance \(r\). This sum is weighted by the function \(f(.,.)\) of the marks of \(u\) and \(x\). The function is standardized by the expected value of \(f(M_1, M_2)\) where \(M_1, M_2\) represent independent, identically distributed random points with the same distribution as the mark of a randomly chosen point (Baddeley, Rubak, and Turner 2015, 646–47).
In the scenario of random labeling, so where the marks are distributed randomly, the mark-weighted \(K\)-function corresponds to the standard Ripley’s \(K\)-function.
Also here, the canonical function is: \(f(m_1, m_2) = m_1 m_2\). This means we weigh each interaction between points by the product of the continuous marks of both points.
It is important to note that the theoretical value of the \(K\)-function is not very informative since it represents the \(K\)-function of a Poisson point process and the underlying point process might not be Poisson. Therefore we compare the mark-weighted with its unmarked analogue. Like this, we can assess whether the points weighted by a continuous mark are more or less correlated than their unmarked analogues (Baddeley, Rubak, and Turner 2015, 647).
Here we will compare the \(L\)-functions weighted by the mark of the gene Esr1 and the unmarked \(L\)-function.
We note that the difference between \(L\)-function weighted by the expression of Esr1 minus the unmarked \(L\)-function is positively different to the poisson difference, meaning that the expression of the continuous mark Esr1 is correlated among itself.
In this chapter we have looked at point pattern methods that can be applied to univariate marks. These univariate marks can either be discrete (in a molecular biological context that could be celltypes) or continuous (e.g. expression of a gene). There are approaches that summarise correlative dependencies among points or that consider spacing functions.
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] spatialFDA_0.99.0 magrittr_2.0.3
[3] stringr_1.5.0 dixon_0.0-8
[5] splancs_2.01-44 spdep_1.2-8
[7] spData_2.3.0 tmap_3.3-4
[9] scater_1.28.0 scran_1.28.2
[11] scuttle_1.10.3 SFEData_1.2.0
[13] SpatialFeatureExperiment_1.2.3 Voyager_1.2.7
[15] rgeoda_0.0.10-4 digest_0.6.33
[17] ncf_1.3-2 sf_1.0-16
[19] reshape2_1.4.4 patchwork_1.2.0
[21] STexampleData_1.8.0 ExperimentHub_2.8.1
[23] AnnotationHub_3.8.0 BiocFileCache_2.8.0
[25] dbplyr_2.3.4 RANN_2.6.1
[27] seg_0.5-7 sp_2.1-1
[29] rlang_1.1.1 ggplot2_3.5.1
[31] dplyr_1.1.3 mixR_0.2.0
[33] spatstat_3.0-6 spatstat.linnet_3.1-1
[35] spatstat.model_3.2-6 rpart_4.1.19
[37] spatstat.explore_3.2-3 nlme_3.1-162
[39] spatstat.random_3.1-6 spatstat.geom_3.2-5
[41] spatstat.data_3.0-1 SpatialExperiment_1.10.0
[43] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
[45] Biobase_2.60.0 GenomicRanges_1.52.1
[47] GenomeInfoDb_1.36.4 IRanges_2.34.1
[49] S4Vectors_0.38.2 BiocGenerics_0.46.0
[51] MatrixGenerics_1.12.3 matrixStats_1.0.0
loaded via a namespace (and not attached):
[1] spatstat.sparse_3.0-2 bitops_1.0-7
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] grpreg_3.4.0 tools_4.3.1
[7] utf8_1.2.3 R6_2.5.1
[9] HDF5Array_1.28.1 mgcv_1.9-1
[11] rhdf5filters_1.12.1 withr_2.5.1
[13] gridExtra_2.3 leaflet_2.2.0
[15] textshaping_0.3.7 leafem_0.2.3
[17] cli_3.6.1 labeling_0.4.3
[19] mvtnorm_1.2-5 proxy_0.4-27
[21] systemfonts_1.0.5 R.utils_2.12.2
[23] dichromat_2.0-0.1 scico_1.5.0
[25] limma_3.56.2 rstudioapi_0.15.0
[27] RSQLite_2.3.1 generics_0.1.3
[29] crosstalk_1.2.0 Matrix_1.5-4.1
[31] ggbeeswarm_0.7.2 fansi_1.0.5
[33] abind_1.4-5 R.methodsS3_1.8.2
[35] terra_1.7-55 lifecycle_1.0.3
[37] yaml_2.3.7 edgeR_3.42.4
[39] rhdf5_2.44.0 tmaptools_3.1-1
[41] grid_4.3.1 blob_1.2.4
[43] promises_1.2.1 dqrng_0.3.1
[45] crayon_1.5.2 lattice_0.21-8
[47] beachmat_2.16.0 KEGGREST_1.40.1
[49] magick_2.8.0 refund_0.1-35
[51] pillar_1.9.0 knitr_1.44
[53] metapod_1.7.0 rjson_0.2.21
[55] boot_1.3-28.1 fda_6.1.8
[57] codetools_0.2-19 wk_0.8.0
[59] glue_1.6.2 vctrs_0.6.4
[61] png_0.1-8 gtable_0.3.4
[63] ks_1.14.2 cachem_1.0.8
[65] xfun_0.40 S4Arrays_1.0.6
[67] mime_0.12 DropletUtils_1.20.0
[69] pracma_2.4.2 fds_1.8
[71] pcaPP_2.0-4 pbs_1.1
[73] units_0.8-4 statmod_1.5.0
[75] bluster_1.10.0 interactiveDisplayBase_1.38.0
[77] ellipsis_0.3.2 bit64_4.0.5
[79] filelock_1.0.2 irlba_2.3.5.1
[81] vipor_0.4.5 KernSmooth_2.23-21
[83] colorspace_2.1-0 DBI_1.1.3
[85] raster_3.6-26 tidyselect_1.2.0
[87] bit_4.0.5 compiler_4.3.1
[89] curl_5.1.0 BiocNeighbors_1.18.0
[91] DelayedArray_0.26.7 scales_1.3.0
[93] classInt_0.4-10 rappdirs_0.3.3
[95] goftest_1.2-3 rainbow_3.8
[97] minqa_1.2.7 fftwtools_0.9-11
[99] spatstat.utils_3.0-5 rmarkdown_2.25
[101] XVector_0.40.0 htmltools_0.5.6.1
[103] pkgconfig_2.0.3 base64enc_0.1-3
[105] lme4_1.1-35.4 sparseMatrixStats_1.12.2
[107] fastmap_1.1.1 htmlwidgets_1.6.2
[109] RLRsim_3.1-8 shiny_1.7.5.1
[111] DelayedMatrixStats_1.22.6 farver_2.1.1
[113] jsonlite_1.8.7 mclust_6.1.1
[115] BiocParallel_1.34.2 R.oo_1.25.0
[117] BiocSingular_1.16.0 RCurl_1.98-1.12
[119] GenomeInfoDbData_1.2.10 s2_1.1.4
[121] Rhdf5lib_1.22.1 munsell_0.5.0
[123] Rcpp_1.0.11 ggnewscale_0.4.9
[125] viridis_0.6.4 stringi_1.7.12
[127] leafsync_0.1.0 MASS_7.3-60
[129] zlibbioc_1.46.0 plyr_1.8.9
[131] parallel_4.3.1 ggrepel_0.9.4
[133] deldir_1.0-9 Biostrings_2.68.1
[135] stars_0.6-4 splines_4.3.1
[137] tensor_1.5 locfit_1.5-9.8
[139] igraph_1.5.1 ScaledMatrix_1.8.1
[141] magic_1.6-1 BiocVersion_3.17.1
[143] XML_3.99-0.14 evaluate_0.22
[145] BiocManager_1.30.22 deSolve_1.40
[147] nloptr_2.1.1 httpuv_1.6.11
[149] tidyr_1.3.0 purrr_1.0.2
[151] polyclip_1.10-6 rsvd_1.0.5
[153] lwgeom_0.2-13 xtable_1.8-4
[155] e1071_1.7-13 RSpectra_0.16-1
[157] later_1.3.1 ragg_1.2.6
[159] viridisLite_0.4.2 class_7.3-22
[161] tibble_3.2.1 memoise_2.0.1
[163] beeswarm_0.4.0 AnnotationDbi_1.62.2
[165] gamm4_0.2-6 cluster_2.1.4
[167] hdrcde_3.4
©2024 The pasta authors. Content is published under Creative Commons CC-BY-4.0 License for the text and GPL-3 License for any code.